The only thing you really need to do to prepare for R training is to install the latest version of R and RStudio. We’ll talk about the difference between R and RStudio on the first day, but for now, just make sure they’re installed. Directions for installing R/RStudio are below. If you run into any problems, check the instructions at R for Data Science Section 1.4.1 and 1.4.2.
NOTE: If you don’t have Administrative privileges on your computer, you will have to submit an IT HelpDesk Ticket (need to be on VPN or NPS network to access this link) to install R and RStudio, which could take a while. PLEASE PLAN AHEAD!
Even if you already have R or RStudio installed, please install the latest versions of both programs. R recently went through a major version change from 3.x to 4.x with some potential code-breaking changes. The latest versions are needed to ensure everyone’s code behaves the same way.
The training will take place over 4 half days. Each day will run from 10-2 MST via MS Teams. The hour before training, and the hour after training will also be posted by at least one trainer as office hours, in case there are questions that couldn’t be handled during training.
Each training session has three trainers assigned, two of which will be the main instructors. The third trainer will manage the chat. For most of the training, one of the trainers will share their screen to walk through the website and then demo with live coding. It will help a lot if you have 2 monitors, so you can have the screen being shared on one monitor and your own session of RStudio open on the other.
Four days is barely enough to scratch the surface on what you can do in R. Our goals with this training are to help you get beyond the initial learning curve that can be really challenging to climb on your own, and to expose you to some of the really useful things R can do for you in your work. The trainers put lot of thought and time into designing this training. We did it because we enjoy coding in R and it has greatly increased efficiency and productivity in our jobs. We hope you have a similar experience as you start out on this R learning curve. As you learn more, please share your skills and code with others to help us build a larger community of R users in IMD who can learn from each other.
Finally, to help us improve this training for future sessions, please leave feedback in the training feedback form.
Welcome to the class!
Learning (or re-learning) how to program can feel like an intimidating learning curve. Congrats on taking the first step! Before we delve into technical topics, let’s talk a little about coding in general.
Errors and broken code are not only part of the learning process, they’re part of the coding process. If you think your code is running perfectly on the first try, you probably didn’t test it thoroughly enough. It can be frustrating and discouraging when you can’t get your code to work the way you want it to, but getting stuck doesn’t make you a bad programmer. Try to approach broken code as a puzzle and a learning opportunity.
As the saying goes,
Good coders write good code; great coders steal great code.
Google and Stack Overflow are great resources when you don’t know off the top of your head how to do something. And reach out to your colleagues too - there’s no reason to waste time writing code that someone else has already written! Just give credit where credit is due, and take some time to make sure you understand what the copied code does. And especially don’t hesitate to reach out to more experienced colleagues with questions.
R is a programming language that was originally developed by statisticians for statistical computing and graphics. R is free and open source, which means you will never need a paid license to use it, and anyone who wants to can view the underlying source code and suggest fixes and improvements. Since its first official release in 1995, R remains one of the leading programming languages for statistics and data visualization, and its capabilities continue to grow.
Every programming language has strengths and weaknesses. We like R because:
rmarkdown package)For more information on the history of R, visit the R Project website.
RStudio is what’s called an integrated development environment (IDE). When you install R, it comes with a simple user interface that lets you write and execute code. Writing code in this interface is kind of like doing word processing in Notepad: it’s simple and straightforward, but soon you’ll start wishing for more features. This is where RStudio comes in. RStudio makes programming in R easier by color coding different types of code, autocompleting code, flagging mistakes (like spellcheck for code), and providing many useful tools with the push of a button or key stroke (e.g. viewing help info).
When you open RStudio, you typically see 4 panes:
RStudio
Source:
This is where the code gets written. When you create a new script or open an existing one, it displays here. In the screenshot above, there’s a script called bat_data_wrangling.R open in the source pane. Note that if you haven’t yet opened or created a new script, you won’t see this pane until you do.
The source pane will helpfully color-code your code to make it easier to read. It will also detect syntax errors (the coding equivalent of a grammar checker). These are flagged with a red “x” next to the line number and a squiggly line under the offending code. When you’re ready to run all or part of your script:
Console:
This is where the code actually runs. When you first open RStudio, the console will tell you the version of R that you’re running (should be R 4.1.0 or greater). We talked above about how you can run a script at the console. You can also type code directly into the console and run it that way. That code won’t get saved to a file, but it’s a great way to experiment and test out lines of code before you add them to your script in the source pane.
The console is also where errors appear if your code breaks. Deciphering errors can be a challenge sometimes, but Googling them is a good place to start.
Environment/History/Connections:
Files/Plots/Packages/Help/Viewer:
There are several settings in the Global Options that everyone should check to make sure we all have consistent settings. Go to Tools -> Global Options and follow the steps below.
load(".RData") at the console the next time you open RStudio to restore everything.Most other settings are whatever you prefer, including everything in the Appearance window.
When you’re working on a code project it’s important to stay organized. Keeping code, input data, and results together in one place will protect your sanity and the sanity of the person who inherits the project. R Studio projects help with this. Creating a new R Studio project for each new code project makes it easier to manage settings and file paths.
Before we create a project, take a look at the Console tab. Notice that at the top of the console there is a folder path. That path is your current working directory.
default working directory
If you refer to a file in R using a relative path, for example ./data/my_data_file.csv, R will look in your current working directory for a folder called data containing a file called my_data_file.csv. Using relative paths is a good idea because the full path is specific to your computer and likely won’t work on a different computer. But there’s no guarantee that everyone has the same default R working directory. This is where projects come in.
Let’s make a project for this class. Click File > New Project. In the window that appears, select New Directory, then select New Project. You will be prompted for a directory name. This is the name of your project folder. For this class, call it imd-r-training-intro. Next, you’ll select what folder to keep your project folder in. Documents/R is a good place to store all of your R projects but it’s up to you. When you are done, click on Create Project.
If you successfully started a project named imd-r-training-intro, you should see it listed at the very top right of your screen. As you start new projects, you’ll want to check that you’re working in the right one before you start coding. Take a look at the Console tab again. Notice that your current working directory is now your project folder. When you look in the Files tab of the bottom right pane, you’ll see that it also defaults to the project folder.
First we need to create a new R script file. Make sure you are working in the imd-r-training-intro project that you just created. Click on the New File icon in the top left corner. In the dropdown, select R Script. The source pane will appear with an untitled empty script. Go ahead and save it by clicking the Save icon (and make sure the Source on Save checkbox is deselected). Call it
my_first_script.R.
We’ll start with something simple. Basic math in R is pretty straightforward and the syntax is similar to simply using a graphing calculator. Try it out! You can use the examples below or come up with your own. Even if you’re using the examples, try to actually type the code instead of copy-pasting - you’ll learn to code faster that way.
To run a single line of code in your script, place your cursor anywhere in that line and press CTRL+ENTER (or click the Run button in the top right of the script pane). To run multiple lines of code, highlight the lines you want to run and hit CTRL+ENTER or click Run.
Text following a hashtag/pound sign (#) will be ignored - use this to leave comments about what your code is doing. Commenting your code is one of the best habits you can form, and you should start now! Comments are a gift to your future self and anyone else who tries to use your code.
# By using this hashtag/pound sign, you are telling R to ignore the text afterwards. This is useful for leaving annotation of notes or instructions for yourself, or someone else using your code
# try this line to generate some basic text and become familiar with where results will appear:
print("Hello, lets do some basic math. Results of operations will appear here")
## [1] "Hello, lets do some basic math. Results of operations will appear here"
# one plus one
1+1
## [1] 2
# two times three, divided by four
(2*3)/4
## [1] 1.5
# basic mathematical and trigonometric functions are fairly similar to what they would be in excel
# calculate the square root of 9
sqrt(9)
## [1] 3
# calculate the cube root of 8 (remember that x^(1/n) gives you the nth root of x)
8^(1/3)
## [1] 2
# get the cosine of 180 degrees - note that trig functions in R expect angles in radians
# also note that pi is a built-in constant in R
cos(pi)
## [1] -1
# calculate 5 to the tenth power
5^10
## [1] 9765625
Notice that when you run a line of code, the code and the result appear in the console. You can also type code directly into the console, but it won’t be saved anywhere. As you get more comfortable with R, it can be helpful to use the console as a “scratchpad” for experimenting and troubleshooting. For now, it’s best to err on the side of saving your code as a script so that you don’t accidentally lose useful work.
Occasionally, it’s enough to just run a line of code and display the result in the console. But typically the code we write is more complex than adding one plus one, and when we execute a line of code, we want to be able to store the result and use it later in the script. This is where variables come in. Variables allow you to assign a value (whether that’s a number, a data table, a chunk of text, or any other type of data that R can handle) to a short, human-readable name. Anywhere you put a variable in your code, R will replace it with its value when your code runs.
R uses the <- symbol for variable assignment. If you’ve used other programming languages (or remember high school algebra), you may be tempted to use = instead. It will work, but there are subtle differences between <- and =, so you should get in the habit of using <-.
# the value of 12.098 is assigned to variable 'a'
a <- 12.098
# and the value 65.3475 is assigned to variable 'b'
b <- 65.3475
# we can now perform whatever mathematical operations we want using these two variables without having to repeatedly type out the actual numbers:
a*b
## [1] 790.5741
(a^b)/((b+a))
## [1] 7.305156e+68
sqrt((a^7)/(b*2))
## [1] 538.7261
In the code above, we assign the variables a and b once. We can then reuse them as often as we want. This is helpful because we save ourselves some typing, reduce the chances of making a typo somewhere, and if we need to change the value of a or b, we only have to do it in one place.
Also notice that when you assign variables, you can see them listed in your Environment tab (top right pane). Remember, everything you see in the environment is just in R’s temporary memory and won’t be saved when you close out of RStudio.
All of the examples you’ve seen so far are fairly contrived for the sake of simplicity. Let’s take a look at some code that everyone here will make use of at some point: reading data from a CSV.
It’s hard to get very far in R without making use of functions. Think of a function as a black box that takes some kind of input (the argument(s)) from the user and outputs a result (the return value). Can you identify the functions in the above examples?
anatomy of a function
The read.csv function is used to bring in a dataset from a CSV file, and it stores the data in one of the fundamental data structures in R: a data frame. read.csv() takes the file path to the CSV as input, and it outputs a data frame containing the data from the CSV.
We’ll get very familiar with data frames in this class, but for the moment just know that it’s a table of data with rows and columns. Data frames are typically organized with rows being records or observations (e.g. sampling locations, individual critters, etc.), and columns being variables that characterize those observations (e.g., species, size, date collected, x/Y coordinates). Once you have read the data in, you can take a quick look at its structure by typing the name of the variable it’s stored in.
# read in the data from tree_growth.csv and assign it as a dataframe to the variable "tree_data"
tree_data <- read.csv("data/tree_growth.csv")
# display the 'tree_data' data frame we just created
tree_data
## Species DBH_in Height_ft Age_yr
## 1 arboreas madeupis 4.0 6 3
## 2 arboreas madeupis 6.0 10 5
## 3 arboreas madeupis 8.0 5 6
## 4 arboreas madeupis 11.0 7 6
## 5 arboreas madeupis 12.0 16 8
## 6 arboreas madeupis 13.0 19 8
## 7 arboreas madeupis 12.0 24 11
## 8 arboreas madeupis 20.0 22 11
## 9 arboreas madeupis 20.0 32 12
## 10 arboreas madeupis 11.0 31 13
## 11 arboreas madeupis 27.0 35 15
## 12 arboreas madeupis 25.0 33 15
## 13 arboreas madeupis 30.0 40 17
## 14 arboreas madeupis 35.0 67 18
## 15 arboreas madeupis 29.0 60 21
## 16 arboreas madeupis 37.0 65 30
## 17 arboreas madeupis 31.0 78 32
## 18 arboreas madeupis 38.0 76 36
## 19 arboreas madeupis 45.0 91 37
## 20 arboreas madeupis 55.0 83 38
## 21 arboreas madeupis 60.0 88 40
## 22 arboreas madeupis 75.0 86 43
## 23 arboreas madeupis 76.0 89 51
## 24 arboreas madeupis 80.0 75 55
## 25 arboreas madeupis 99.0 100 61
## 26 planteus falsinius 9.0 6 2
## 27 planteus falsinius 9.3 9 4
## 28 planteus falsinius 10.1 8 5
## 29 planteus falsinius 10.3 14 4
## 30 planteus falsinius 10.7 12 6
## 31 planteus falsinius 10.8 10 8
## 32 planteus falsinius 11.0 10 7
## 33 planteus falsinius 11.3 11 7
## 34 planteus falsinius 11.4 13 9
## 35 planteus falsinius 11.6 13 12
## 36 planteus falsinius 11.6 15 15
## 37 planteus falsinius 11.9 23 17
## 38 planteus falsinius 12.3 16 18
## 39 planteus falsinius 12.4 21 15
## 40 planteus falsinius 12.6 23 17
## 41 planteus falsinius 12.6 18 21
## 42 planteus falsinius 12.8 25 20
## 43 planteus falsinius 15.0 35 30
Compared to scrolling through a dataset in Excel, examining a dataset in R can feel unintuitive at first. Stick with it, though. Once you get used to exploring data in R, you’ll find that R will help you learn more about your dataset in a more efficient manner.
Let’s see what we can learn about the dataset as a whole. The most useful functions for this are: - head() Show only the first few lines of the dataframe in the console. The function is useful for taking a quick glance to ensure that column names were properly assigned, and to take a quick look a column names without printing the whole dataset. - summary() Show a basic statistical summary of the dataset. - str() Show information about the structure of the dataset, including number of rows and columns, column data types, and the first few values in each column. - View() In a new tab, display a spreadsheet-like view of the full dataset with options to sort and filter. Note that you can’t change the data in this view - it is read-only.
head(tree_data) # Show the first few lines of the dataframe. Defaults to showing the first 6 rows
## Species DBH_in Height_ft Age_yr
## 1 arboreas madeupis 4 6 3
## 2 arboreas madeupis 6 10 5
## 3 arboreas madeupis 8 5 6
## 4 arboreas madeupis 11 7 6
## 5 arboreas madeupis 12 16 8
## 6 arboreas madeupis 13 19 8
head(tree_data, n = 10) # Show the first 10 rows
## Species DBH_in Height_ft Age_yr
## 1 arboreas madeupis 4 6 3
## 2 arboreas madeupis 6 10 5
## 3 arboreas madeupis 8 5 6
## 4 arboreas madeupis 11 7 6
## 5 arboreas madeupis 12 16 8
## 6 arboreas madeupis 13 19 8
## 7 arboreas madeupis 12 24 11
## 8 arboreas madeupis 20 22 11
## 9 arboreas madeupis 20 32 12
## 10 arboreas madeupis 11 31 13
summary(tree_data) # View a basic statistical summary
## Species DBH_in Height_ft Age_yr
## Length:43 Min. : 4.00 Min. : 5.00 Min. : 2.00
## Class :character 1st Qu.:11.00 1st Qu.: 12.50 1st Qu.: 7.50
## Mode :character Median :12.60 Median : 23.00 Median :15.00
## Mean :24.78 Mean : 35.35 Mean :18.81
## 3rd Qu.:30.50 3rd Qu.: 62.50 3rd Qu.:25.50
## Max. :99.00 Max. :100.00 Max. :61.00
str(tree_data) # Get info about the structure of the data
## 'data.frame': 43 obs. of 4 variables:
## $ Species : chr "arboreas madeupis" "arboreas madeupis" "arboreas madeupis" "arboreas madeupis" ...
## $ DBH_in : num 4 6 8 11 12 13 12 20 20 11 ...
## $ Height_ft: int 6 10 5 7 16 19 24 22 32 31 ...
## $ Age_yr : int 3 5 6 6 8 8 11 11 12 13 ...
View(tree_data) # Open a new tab with a spreadsheet view of the data
We’re going to take a little detour into data structures at this point. It’ll all tie back in to our tree data.
The data frame we just examined is a type of data structure. A data structure is what it sounds like: it’s a structure that holds data in an organized way. There are multiple data structures in R, including vectors, lists, arrays, matrices, data frames, and tibbles (more on this unfortunately-named data structure later). Today we’ll focus on vectors and data frames.
Vectors are the simplest data structure in R. You can think of vectors as one column of data in an Excel spreadsheet, and the elements are each row in the column. Here are some examples of vectors:
digits <- 0:9 # Use x:y to create a sequence of integers starting at x and ending at y
digits
## [1] 0 1 2 3 4 5 6 7 8 9
is_odd <- rep(c(FALSE, TRUE), 5) # Use rep(x, n) to create a vector by repeating x n times
is_odd
## [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
shoe_sizes <- c(7, 7.5, 8, 8.5, 9, 9.5, 10, 10.5, 11, 11.5)
shoe_sizes
## [1] 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5
favorite_birds <- c("greater roadrunner", "Costa's hummingbird", "kakapo")
favorite_birds
## [1] "greater roadrunner" "Costa's hummingbird" "kakapo"
Note the use of c(). The c() function stands for combine, and it combines elements into a single vector. The c() function is a fairly universal way to combine multiple elements in R, and you’re going to see it over and over.
Let’s play around with vectors a little more. We can use is.vector() to test whether something is a vector. We can get the length of a vector with length(). Note that single values in R are just vectors of length one.
is.vector(digits) # Should be TRUE
## [1] TRUE
is.vector(favorite_birds) # Should also be TRUE
## [1] TRUE
length(digits) # Hopefully this is 10
## [1] 10
length(shoe_sizes)
## [1] 10
# Even single values in R are stored as vectors
length_one_chr <- "length one vector"
length_one_int <- 4
is.vector(length_one_chr)
## [1] TRUE
is.vector(length_one_int)
## [1] TRUE
length(length_one_chr)
## [1] 1
length(length_one_int)
## [1] 1
In the examples above, each vector contains a different type of data. digits contains integers, is_odd contains logical (true/false) values, favorite_birds contains text, and shoe_sizes contains decimal numbers. That’s because a given vector can only contain a single type of data. In R, there are four data types that we will typically encounter:
"hello", "3", "R is my favorite programming language")23, 3.1415)L to it or use as.integer() (e.g. 5L, as.integer(30)).TRUE, FALSE). Note that TRUE and FALSE must be all-uppercase.There are two more data types, complex and raw, but you are unlikely to encounter these so we won’t cover them here.
You can use the class() function to get the data type of a vector:
class(favorite_birds)
## [1] "character"
class(shoe_sizes)
## [1] "numeric"
class(digits)
## [1] "integer"
class(is_odd)
## [1] "logical"
If you need to access a single element of a vector, you can use the syntax my_vector[x] where x is the element’s index (the number corresponding to its position in the vector). You can also use a vector of indices to extract multiple elements from the vector. Note that in R, indexing starts at 1 (i.e. my_vector[1] is the first element of my_vector). If you’ve coded in other languages, you may be used to indexing starting at 0.
second_favorite_bird <- favorite_birds[2]
second_favorite_bird
## [1] "Costa's hummingbird"
top_two_birds <- favorite_birds[c(1,2)]
top_two_birds
## [1] "greater roadrunner" "Costa's hummingbird"
Logical vectors can also be used to subset a vector. The logical vector must be the length of the vector you are subsetting.
odd_digits <- digits[is_odd]
odd_digits
## [1] 1 3 5 7 9
Let’s revisit our trees data frame. We’ve explored the data frame as a whole, but it’s often useful to look at one column at a time. To do this, we’ll use the $ syntax:
tree_data$Species
## [1] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [4] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [7] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [10] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [13] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [16] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [19] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [22] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [25] "arboreas madeupis" "planteus falsinius" "planteus falsinius"
## [28] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [31] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [34] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [37] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [40] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [43] "planteus falsinius"
tree_data$Age_yr
## [1] 3 5 6 6 8 8 11 11 12 13 15 15 17 18 21 30 32 36 37 38 40 43 51 55 61
## [26] 2 4 5 4 6 8 7 7 9 12 15 17 18 15 17 21 20 30
You can also use square brackets [] to access data frame columns. You can specify columns by name or by index (integer indicating position of column). It’s almost always best to refer to columns by name when possible - it makes your code easier to read, and it prevents your code from breaking if columns get reordered.
The square bracket syntax allows you to select multiple columns at a time and to select a subset of rows.
tree_data[, "Species"] # Same as tree_data$Species
## [1] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [4] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [7] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [10] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [13] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [16] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [19] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [22] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [25] "arboreas madeupis" "planteus falsinius" "planteus falsinius"
## [28] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [31] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [34] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [37] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [40] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [43] "planteus falsinius"
tree_data[, c("Species", "Age_yr")] # Data frame with only Species and Age_yr columns
## Species Age_yr
## 1 arboreas madeupis 3
## 2 arboreas madeupis 5
## 3 arboreas madeupis 6
## 4 arboreas madeupis 6
## 5 arboreas madeupis 8
## 6 arboreas madeupis 8
## 7 arboreas madeupis 11
## 8 arboreas madeupis 11
## 9 arboreas madeupis 12
## 10 arboreas madeupis 13
## 11 arboreas madeupis 15
## 12 arboreas madeupis 15
## 13 arboreas madeupis 17
## 14 arboreas madeupis 18
## 15 arboreas madeupis 21
## 16 arboreas madeupis 30
## 17 arboreas madeupis 32
## 18 arboreas madeupis 36
## 19 arboreas madeupis 37
## 20 arboreas madeupis 38
## 21 arboreas madeupis 40
## 22 arboreas madeupis 43
## 23 arboreas madeupis 51
## 24 arboreas madeupis 55
## 25 arboreas madeupis 61
## 26 planteus falsinius 2
## 27 planteus falsinius 4
## 28 planteus falsinius 5
## 29 planteus falsinius 4
## 30 planteus falsinius 6
## 31 planteus falsinius 8
## 32 planteus falsinius 7
## 33 planteus falsinius 7
## 34 planteus falsinius 9
## 35 planteus falsinius 12
## 36 planteus falsinius 15
## 37 planteus falsinius 17
## 38 planteus falsinius 18
## 39 planteus falsinius 15
## 40 planteus falsinius 17
## 41 planteus falsinius 21
## 42 planteus falsinius 20
## 43 planteus falsinius 30
tree_data[1:5, "Species"] # First five elements of Species column
## [1] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [4] "arboreas madeupis" "arboreas madeupis"
Now that we’ve learned about vectors, these data frame columns might look familiar. A data frame is just a collection of vectors of the same length, where each vector is a column.
is.vector(tree_data$Species)
## [1] TRUE
class(tree_data$Species)
## [1] "character"
is.vector(tree_data$Age_yr)
## [1] TRUE
class(tree_data$Age_yr)
## [1] "integer"
str(tree_data) # Check out the abbreviations next to the column names: chr, num, and int. These are the data types of the columns.
## 'data.frame': 43 obs. of 4 variables:
## $ Species : chr "arboreas madeupis" "arboreas madeupis" "arboreas madeupis" "arboreas madeupis" ...
## $ DBH_in : num 4 6 8 11 12 13 12 20 20 11 ...
## $ Height_ft: int 6 10 5 7 16 19 24 22 32 31 ...
## $ Age_yr : int 3 5 6 6 8 8 11 11 12 13 ...
R is great at working with vectors because they are the fundamental data structure of the language. Since data frame columns are vectors, they are typically easy to operate on.
Basic statistical summary functions aren’t too different from what you’d do in Excel.
#to calculate the mean of the 'age' column in the original dataframe:
mean(tree_data$Age_yr)
## [1] 18.81395
#or to calculate the mean of the DBH vector we created:
dbh <- tree_data$DBH_in
mean(dbh)
## [1] 24.78372
#like-wise for the median, standard deviation, and minimum:
median(tree_data$Age_yr)
## [1] 15
sd(tree_data$Age_yr)
## [1] 15.02261
min(tree_data$Age_yr)
## [1] 2
Another useful way to summarize a column is to get the unique values it contains. This is typically most useful with character columns (e.g. species) but works with any data type.
unique(tree_data$Species) # in this line, we are printing all of the unique species names in the dataset (in this case 2).
## [1] "arboreas madeupis" "planteus falsinius"
species_names <- unique(tree_data$Species) # assign unique species names to a variable
Summarizing data is useful but we also want to be able to modify it. So let’s say we want to convert the ‘Height_ft’ column in tree_data from feet to inches. We’ll start by assigning a conversion factor to the variable in_per_ft.
in_per_ft <- 12 # Since there are 12 inches in a foot
We can then use this conversion factor to convert the ‘Height_ft’ column to inches, and place this converted value in a new column we will create in the data frame. Notice below that if you assign a vector to a column that doesn’t exist yet, R will automatically create that column. We’ll use the head() function to verify that the Height_in column was created correctly.
# here we are specifying the new column on the left side of the '<-', and telling R what we want put into it on the right side of the '<-'.
tree_data$Height_in <- tree_data$Height_ft * in_per_ft
# we can now use the 'head function to check our work, and make sure the proper conversion was carried out:
head(tree_data)
## Species DBH_in Height_ft Age_yr Height_in
## 1 arboreas madeupis 4 6 3 72
## 2 arboreas madeupis 6 10 5 120
## 3 arboreas madeupis 8 5 6 60
## 4 arboreas madeupis 11 7 6 84
## 5 arboreas madeupis 12 16 8 192
## 6 arboreas madeupis 13 19 8 228
It’s often easier to get a sense of our data by visualizing it, so let’s explore the basic tools for taking a first glance at our data.
We’ll start with a histogram. The code below generates a basic histogram plot of a specific column in the dataframe (remember from earlier that this is denoted by the $ sign) using the hist() function. We’ll start by creating a histogram for the ‘Age_yr’ column.
hist(x = tree_data$Age_yr)
Looking at the histogram, we can see that the age of most trees in the dataset fall in the 0-20 year old age class.
The data frame also includes information on DBH, and height of the trees. What if we want to quickly take a look at how these two variables relate to one another? To get a basic plot, you can use the plot() function. A simple call to plot() takes two arguments: x (a vector of x coordinates) and y (a vector of y coordinates). You can learn more about plot() by typing ?plot at the console.
# this should generate an ugly but effective scatterplot of tree height vs age. It would appear that older trees are taller!
plot(x = tree_data$Age_yr, y = tree_data$Height_ft)
# let's try the same, but with DBH as the dependent variable on the y axis:
# again, we should see a plot below showing that, even in a make-believe forest, older trees of both species tend to be thicker!
plot(x = tree_data$Age_yr, y = tree_data$DBH_in)
There are a number of options to get help with R. If you’re trying to figure out how to use a function, you can type ?function_name. For example ?plot will show the R documentation for that function in the Help panel.
?plot
?dplyr::filter
If you have a question, online resources include Stackexchange and Stackoverflow are extremely helpful. Google searches are a great first step. It helps to include “in R” in every search related to R code.
Don’t hesitate to reach out to colleagues for help as well! If you are stuck on something and the answers on Google are more confusing than helpful, don’t be afraid to ask a human. Every experienced R programmer was a beginner once, so chances are they’ve encountered the same problem as you at some point. There is an R-focused data science community of practice for I&M folks, which anyone working in R (regardless of experience!) is invited and encouraged to join. The Teams join code will be shared with course participants.
Goals of this Day
Disclaimer: While R provides lots of ways to check and clean up data, these tools are no replacement for sound data management that prevents data issues from happening in the first place.
What is a package? A package is a group of functions that someone has put together. They come with documentation that tells you about what each function does.
First, load package
install.packages("janitor")
To use a package, call it by using the function “library”
library('janitor')
This is called “loading a package” or “calling a package”.
Look at the vignette
take time to click a function from the vignette list walk through how to get helpful info
Did you get “error” text?
Text pops up in the console after you run code You must decide whether it means you have to change something or not
library('janitor')
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
Challenge: install + call the package ‘tidyverse’
Solution: install + call the package ‘tidyverse’
install.packages("tidyverse")
library('tidyverse')
Data wrangling is all the steps you have to take to prepare your data for analysis or plotting. Data munging or data cleaning are other terms used to describe this process.Material presented today can be found in Hadley Wickham’s book “R for Data Science”, which is free and online. https://r4ds.had.co.nz/wrangle-intro.html
If you’re starting from this tab of the lesson, you’ll need to load two packages before proceeding.
library('janitor')
library('tidyverse')
Major Components to Data Wrangling
Sometimes you get data where the column names are messy. For an example of data that needs tidying before exploration and analyses, we’ll use fake data. Copy + paste the code chunk below into your script, and run it in the console to generate a dataframe called “messy_data”.
Here, we’ll clarify that we are using the word “data.frame” or “dataframe” to describe the data. The word “tibble” also shows up. We’ll use these terms interchangeably.
messy_data <- tibble(TrailName = c('Bayside', 'Coastal', 'Tidepools',
'Monument', 'Tidepools', NA),
Visitor_count = c(200, 300, 400, 500, 400, NA),
`empty row` = NA)
This data.frame captures the hypothetical number of visitors across four trails at a park. There are a few qualities that make it difficult to interpret. Seeing this data, what would you change to make it most consistent?
messy_data
## # A tibble: 6 x 3
## TrailName Visitor_count `empty row`
## <chr> <dbl> <lgl>
## 1 Bayside 200 NA
## 2 Coastal 300 NA
## 3 Tidepools 400 NA
## 4 Monument 500 NA
## 5 Tidepools 400 NA
## 6 <NA> NA NA
There are four things I suggest fixing with these data:
The Janitor package has ways to fix all of these data situations and aid in data cleaning. Let’s proceed with changing the column names.
The clean_names() function in the janitor package will consistently format column names.
clean_data <- clean_names(messy_data)
clean_data
## # A tibble: 6 x 3
## trail_name visitor_count empty_row
## <chr> <dbl> <lgl>
## 1 Bayside 200 NA
## 2 Coastal 300 NA
## 3 Tidepools 400 NA
## 4 Monument 500 NA
## 5 Tidepools 400 NA
## 6 <NA> NA NA
Now, we’ll deal with the empty row and empty column. To remove empty rows or columns, we can use the remove_empty() function.
The arguments for the function are as follows:
remove_empty(
# first argument is the data
dat,
# the next specifies if you want to get rid of empty rows, columns, or both
which = c("rows", "cols"),
# this last one asks whether (TRUE) or not (FALSE) you want R to tell you which rows or columns were removed
quiet = TRUE)
Let’s give it a try
Removing empty rows
clean_data2 <- remove_empty(clean_data, which = c('rows'), quiet = FALSE)
clean_data2
## # A tibble: 5 x 3
## trail_name visitor_count empty_row
## <chr> <dbl> <lgl>
## 1 Bayside 200 NA
## 2 Coastal 300 NA
## 3 Tidepools 400 NA
## 4 Monument 500 NA
## 5 Tidepools 400 NA
Removing empty rows and columns
clean_data2 <- remove_empty(clean_data, which = c('rows', 'cols'), quiet = TRUE)
clean_data2
## # A tibble: 5 x 2
## trail_name visitor_count
## <chr> <dbl>
## 1 Bayside 200
## 2 Coastal 300
## 3 Tidepools 400
## 4 Monument 500
## 5 Tidepools 400
What happens if you’d like to both clean_names() and remove_empty() on the same data?
Solution 1 - run as a sequence
clean_data <- clean_names(messy_data)
clean_data <- remove_empty(clean_data, which = c('rows', 'cols'), quiet = TRUE)
clean_data
## # A tibble: 5 x 2
## trail_name visitor_count
## <chr> <dbl>
## 1 Bayside 200
## 2 Coastal 300
## 3 Tidepools 400
## 4 Monument 500
## 5 Tidepools 400
Solution 2 - use the pipe operator
Pipe operators (%>% or |>) can be read like the word “then”. For example, clean_names() %>% remove_empty() can be read as “clean names, then remove empty”. They allow for a more clean workflow.
clean_data <- clean_names(messy_data) %>%
remove_empty(which = c('rows', 'cols'), quiet = TRUE)
clean_data
## # A tibble: 5 x 2
## trail_name visitor_count
## <chr> <dbl>
## 1 Bayside 200
## 2 Coastal 300
## 3 Tidepools 400
## 4 Monument 500
## 5 Tidepools 400
# note that replacing %>% with |> also works
Note that there is no data argument in the remove_empty() function. This is because, when using a pipe, the data argument is inherited from the prior function.
A final function we’ll discuss is distinct() from the dplyr package, which is park of tidyverse. It removes duplicate rows from data.
clean_data <- clean_names(messy_data) %>%
remove_empty(which = c('rows', 'cols'), quiet = TRUE) %>%
distinct()
clean_data
## # A tibble: 4 x 2
## trail_name visitor_count
## <chr> <dbl>
## 1 Bayside 200
## 2 Coastal 300
## 3 Tidepools 400
## 4 Monument 500
A word of caution - sometimes you may not want to remove duplicate rows, or may not want to remove empty rows, because they are important to the data. You know your data best. Use functions responsibly.
The code chunk below creates messy data, called “messy_data2”. Use functions from the Janitor package to tidy the data, and name the result “clean_data2”. See if you can use the pipe operator to streamline the script.
messy_data2 <- tibble(`Park Code` = c(NA, NA, 'CABR', 'CHIS', 'SAMO', 'SAMO'),
visitor_count = c(NA, NA, 400, 500, 400, 400),
`empty row` = NA)
clean_data2 <- clean_names(messy_data2) %>%
remove_empty(which = c('rows', 'cols'), quiet = TRUE) %>%
distinct()
In this section, we’ll introduce functions in the dplyr package. This package is one of the packages that make up the tidyverse, so you won’t have to install and call another package. The dplyr package is a group of functions that help to manipulate and summarize data. https://www.rdocumentation.org/packages/dplyr/versions/0.7.8
You can think of these functions like verbs that accomplish tasks, and are strung together by the pipe operator (%>% or |>).
We’ll cover the following functions:
The data we’ll use for modules 3-5 is from the Tidy Tuesday community on GitHub. Each week on Tuesday, the community posts data for exploratory data analysis and visualization. You can add data from this page to your RStudio environment by copying + pasting the “Get the data!” code chunk into your script and running it in the console.
https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-09-17
The data we’ll use is in csv format, so we’ll need to call the readr package, as well as the tidyverse and janitor for this lesson (if you’ve already called tidyverse and janitor, don’t run those lines).
library('readr')
library('janitor')
library('tidyverse')
Next, we’ll load the data from the R for Data Science Tidy Tuesday GitHub page. There are three lines which will load three data.frames.
For now, we’ll focus on the park_visits data.frame, and will circle back to the other two later. There are descriptions of each column for each data.frame on the Tidy Tuesday website linked above if you’d like more information.
park_visits <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv")
state_pop <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/state_pop.csv")
gas_price <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv")
Dplyr is a package that is part of the tidyverse, which is used with data.frames. It makes tasks like filtering and summarizing data easier. Many of the functions within the package are “verb-like” so they are easy to understand and remember.
select() - keeps the columns you indicate and drops the rest. Since select() puts the columns in the order you type them, you can also use select() to reorder columns.
rename() - keeps all the columns, but renames the ones you indicate
Since we won’t be using geographical information, other than state, in our module, we’ll select some columns from the park_visits data to use today.
Note that, the name to the left of the arrow is the resulting data.frame, the information to the right that’s connected by pipes are the steps to get to the result. I normally name the result something different than the original data.frame so that the original data aren’t affected. However, since we won’t need the geographical information later on, I will name this one “park_visits”.
# let's see the column names in park_visits
colnames(park_visits)
## [1] "year" "gnis_id" "geometry"
## [4] "metadata" "number_of_records" "parkname"
## [7] "region" "state" "unit_code"
## [10] "unit_name" "unit_type" "visitors"
# we can also look at the first few rows by using head()
head(park_visits)
## # A tibble: 6 x 12
## year gnis_id geometry metadata number_of_records parkname region state
## <chr> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 1904 1163670 POLYGON <NA> 1 Crater Lake PW OR
## 2 1941 1531834 MULTIPOLYGON <NA> 1 Lake Roose~ PW WA
## 3 1961 2055170 MULTIPOLYGON <NA> 1 Lewis and ~ PW WA
## 4 1935 1530459 MULTIPOLYGON <NA> 1 Olympic PW WA
## 5 1982 277263 POLYGON <NA> 1 Santa Moni~ PW CA
## 6 1919 578853 MULTIPOLYGON <NA> 1 <NA> NE ME
## # ... with 4 more variables: unit_code <chr>, unit_name <chr>, unit_type <chr>,
## # visitors <dbl>
# you can either select for the columns you want
park_visits1 <- park_visits %>% select(year, parkname, unit_code, state, visitors)
# or against the columns you don't want, by using the minus sign before the column name
park_visits2 <- park_visits %>% select(-gnis_id, -geometry, -metadata, -number_of_records, -region, -unit_name, -unit_type)
# you can check if you have the same result by using colnames()
colnames(park_visits1)
## [1] "year" "parkname" "unit_code" "state" "visitors"
colnames(park_visits2)
## [1] "year" "parkname" "state" "unit_code" "visitors"
# to clear up our environment, we'll get rid of park_visits1 and 2 using remove()
remove(park_visits1, park_visits2)
# since we won't need all of the columns in park_visits, we'll change that file (note that park_visits is on both the left and right sides of the pipe)
park_visits <- park_visits %>% select(year, parkname, unit_code, state, visitors)
Taking a look at the park_visits data, there are some column name inconsistencies. For example, some are in snake case (unit_code), but some are not (parkname). To rename a column, use rename(). The structure of this argument is as follows:
rename(dataset, new_name = old_name) OR dataset %>% rename(new_name = old_name)
# rename parkname column to make it snake case
park_visits <- park_visits %>% rename(park_name = parkname)
# we can check if we've done it correctly by looking at the column names
colnames(park_visits)
## [1] "year" "park_name" "unit_code" "state" "visitors"
filter() - filters out rows from a data.frame based on values in columns can combine criteria using & (and), | (or), ! (not)
If you’d like to get visitation data for just parks in the state of Washington (WA), you can use the filter function. The structure is as follows:
wa_park_visits <- filter(
# name of data.frame
park_visits,
# condition
state == 'WA'
)
# head() shows the first few rows of data
head(wa_park_visits)
# you can use the unique() to check you've done the filtering correctly
unique(wa_park_visits$state)
Now that you’ve got parks in the state of Washington, perhaps you’d like to know which parks in WA had over 5 million total visitors. This will require more intensive filtering. We’ll have to filter for parks where state == ‘WA’ AND year == ‘Total’ AND visitation >= 5,000,000.
wa_park_visits <- park_visits %>%
# filter for parks in WA
filter(state == 'WA' &
# filter for total visitation
year == 'Total' &
# filter for over 5 million visitors
visitors > 5000000)
# head() shows the first few rows of data
head(wa_park_visits)
You also might be wondering why we use “==” instead of “=”. In R, the single equal sign “=” is used to assign something. For example, if you want to make some variable equal to a number, you could tell R “x = 5”, and it would remember that x is equal to 5. To evaluate something as TRUE/FALSE, you use “==”. If you type “5 == 3” into the console, it will return the answer “FALSE”. The filtering argument uses this TRUE/FALSE evaluation - filtering for rows where the stated condition - such as year == ’Total” - is TRUE.
Time for the first challenge. Using the park_visits data, create a new data.frame containing total visitation for a state of your choice. Note that some “states” aren’t states. For example, VI refers to the Virgin Islands. How can you check your work?
pr_park_visits <- park_visits %>%
# filter for parks in Puerto Rico
filter(state == 'PR' &
# filter for total visitation
year == 'Total')
# head() shows the first few rows of data
head(pr_park_visits)
# San Juan National Historic Site (Puerto Rico) had > 55 million visitors!
# remove result from environment
remove(pr_park_visits)
arrange() reorders rows. You can think of this like “sort” in Excel. While this is generally not needed in R, it can be helpful for exploring your data, and when working with time series data and lagged values.
As an example, we’ll re-visit total park visitation in WA, sorting by the number of visitors.
wa_park_visits <- park_visits %>%
# filter for parks in WA
filter(state == 'WA' &
# filter for total visitation
year == 'Total' &
# filter for over 5 million visitors
visitors > 5000000) %>%
# arrange by visitation - default is ascending order
arrange(visitors)
# look at result (df is short so head() not needed)
wa_park_visits
## # A tibble: 9 x 5
## year park_name unit_code state visitors
## <chr> <chr> <chr> <chr> <dbl>
## 1 Total Lewis and Clark LEWI WA 9515314
## 2 Total <NA> NEPE WA 9566104
## 3 Total Ross Lake ROLA WA 11335924
## 4 Total North Cascades NOCA WA 14809679
## 5 Total Fort Vancouver FOVA WA 20238294
## 6 Total San Juan Island SAJH WA 23066952
## 7 Total Lake Roosevelt LARO WA 62247752
## 8 Total Mount Rainier MORA WA 94488801
## 9 Total Olympic OLYM WA 160737962
# to arrange by descending order, we can use desc()
wa_park_visits2 <- wa_park_visits %>%
arrange(desc(visitors))
# look at result
wa_park_visits2
## # A tibble: 9 x 5
## year park_name unit_code state visitors
## <chr> <chr> <chr> <chr> <dbl>
## 1 Total Olympic OLYM WA 160737962
## 2 Total Mount Rainier MORA WA 94488801
## 3 Total Lake Roosevelt LARO WA 62247752
## 4 Total San Juan Island SAJH WA 23066952
## 5 Total Fort Vancouver FOVA WA 20238294
## 6 Total North Cascades NOCA WA 14809679
## 7 Total Ross Lake ROLA WA 11335924
## 8 Total <NA> NEPE WA 9566104
## 9 Total Lewis and Clark LEWI WA 9515314
# remove one result
remove(wa_park_visits2)
pull() takes data from a column and returns in in vector form
pull() %>% table() is a handy way to see a summary of categorical data (e.g. how may observations for each state there are in your data)
# two ways to get the number of observations per state
# option 1 - use pull() to isolate a column and table() to get # of observations
state_n <- park_visits %>%
# pull the state column only
pull(state) %>%
# get # of observations by state in table output
table()
# option 2 - use group_by() to isolate column(s) and tally() to get # of obs
state_n2 <- park_visits %>%
# group by state
group_by(state) %>%
# get tally of observations by state in output
tally()
# remove created objects
remove(state_n, state_n2)
mutate() makes new columns by doing calculations on old columns
As an example, let’s calculate the total visitation for parks with over 5 million total visitors in WA in millions, by dividing the number of total visitors by 1,000,000.
wa_park_visits_millions <- wa_park_visits %>%
mutate(visitors_mil = visitors/1000000)
wa_park_visits_millions
## # A tibble: 9 x 6
## year park_name unit_code state visitors visitors_mil
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Total Lewis and Clark LEWI WA 9515314 9.52
## 2 Total <NA> NEPE WA 9566104 9.57
## 3 Total Ross Lake ROLA WA 11335924 11.3
## 4 Total North Cascades NOCA WA 14809679 14.8
## 5 Total Fort Vancouver FOVA WA 20238294 20.2
## 6 Total San Juan Island SAJH WA 23066952 23.1
## 7 Total Lake Roosevelt LARO WA 62247752 62.2
## 8 Total Mount Rainier MORA WA 94488801 94.5
## 9 Total Olympic OLYM WA 160737962 161.
group_by() indicates that data is in groups, so you can do calculations separately for each group
group_by() %>% mutate() will add a column and return the same number of rows as the original data
This is helpful when you want to get group summary information without sacrificing the original values. For example, if we want to know what percent of the total visitation of a park occurs in each year, and add this to the original data.
park_visits2 <- park_visits %>%
# remove the rows with Totals per park
filter(year != "Total") %>%
# do calculations by park
group_by(park_name) %>%
# add visit_percent column
# visits fore each year divided by the sum of total visits for each park
# the na.rm = TRUE means that NA values are not included in calculations
# round(2) is rounds result to 2 decimal places for easier reading
mutate(visit_percent = (100*visitors/sum(visitors, na.rm = TRUE)) %>%
round(2))
# take a look at the result
head(park_visits2)
## # A tibble: 6 x 6
## # Groups: park_name [6]
## year park_name unit_code state visitors visit_percent
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 1904 Crater Lake CRLA OR 1500 0
## 2 1941 Lake Roosevelt LARO WA 0 0
## 3 1961 Lewis and Clark LEWI WA 69000 0.73
## 4 1935 Olympic OLYM WA 2200 0
## 5 1982 Santa Monica Mountains SAMO CA 468144 2.66
## 6 1919 <NA> ACAD ME 64000 0
# remove from environment
remove(park_visits2)
group_by() and summarize() (or summarise()) also work with grouped data. Here you calculate a summary for each group. Output data frame will have the columns that define the group and the summary data, will have one row for each group.
I find this combination most helpful for calculating means and standard deviations for data. You can also find minimum and maximum values. We’ll filter for total visitation numbers for each park, then calculate a visitation summary for each state.
state_summary <- park_visits %>%
# filter for total visitation numbers (to avoid double counts)
filter(year == 'Total') %>%
# do calculations by state
group_by(state) %>%
# calculate summaries
summarize(
# mean number of total visitors across parks in each state
mean_visit = mean(visitors, na.rm = TRUE),
# sd of total visitors across parks in each state
sd_visit = sd(visitors, na.rm = TRUE),
# min total visitors across parks in each state
min_visit = min(visitors, na.rm = TRUE),
# max total visitors across parks in each state
max_visit = max(visitors, na.rm = TRUE),
# get number of park units w data in each state
n_parks = n()
)
# take a look at the result
head(state_summary)
## # A tibble: 6 x 6
## state mean_visit sd_visit min_visit max_visit n_parks
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 AK 4661539 6737110. 10490 19476522 22
## 2 AL 2998848. 2162876. 365883 5487784 5
## 3 AR 20647567. 34237894. 60563 92028933 7
## 4 AS 113694 NA 113694 113694 1
## 5 AZ 23655713. 46310344. 348811 205486894 19
## 6 CA 61322519. 120602434. 6349 611031225 26
# if you want to continue analyzing this summarized data, you must first ungroup()
# for example:
state_summary <- state_summary %>%
ungroup() %>%
mutate(dif = max_visit - min_visit)
# remove from environment
remove(state_summary)
For this second challenge, you’ll start with the park_visits data. Calculate the average visitation divided by the number of parks for each state. The final data.frame should be arranged in descending order by a numeric column of your choice.
challenge2 <- park_visits %>%
# filter for total visitation numbers (to avoid double counts)
filter(year == 'Total') %>%
# do calculations by state
group_by(state) %>%
# calculate summaries
summarize(
# mean number of total visitors across parks in each state
mean_visit = mean(visitors, na.rm = TRUE),
# get number of park units w data in each state
n_parks = n()
) %>%
# ungroup to do another calculation
ungroup() %>%
# calculate visit/n
mutate(visit_per_park = mean_visit/n_parks) %>%
# select relevant columns
select(state, visit_per_park) %>%
# arrange in descending order
arrange(desc(visit_per_park))
# take a look at the result
head(challenge2)
## # A tibble: 6 x 2
## state visit_per_park
## <chr> <dbl>
## 1 NV 103867788.
## 2 PR 55928192
## 3 ME 40890226.
## 4 NC 19591218.
## 5 MS 19160525
## 6 OK 18206483.
# remove from environment
remove(challenge2)
This section will go over the use of functions to change data from wide to long format and vice-versa.
We will:
pivot_longer()pivot_Wider()If you’re starting with this section, run the following code chunk to get caught up.
# call packages
library("readr")
library("janitor")
library("tidyverse")
# get data
park_visits <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv")
state_pop <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/state_pop.csv")
gas_price <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv")
# cut some columns from park_visits data
park_visits <- park_visits %>% select(year, parkname, unit_code, state, visitors)
Depending on your goal, you may want to have your data formatted in multiple ways. For example, it is much easier for humans to read wide format data, while most R functions rely on long format data. Here is a quick demonstration using park visits data:
# we'll first take a subset of the park visits data
medn_visits <- park_visits %>%
# get parks in MEDN and no totals, years 2010-2015
filter(unit_code %in% c("CABR", "CHIS", "SAMO") & year != "Total") %>%
# make year a number, since no more text
mutate(year = as.numeric(year)) %>%
# get years 2010:2015
filter(year >= 2010 & year <= 2015) %>%
# arrange in ascending order
arrange(year) %>%
# select relevant columns
select(unit_code, year, visitors)
# if we take a look at the data, it is in long format, each observation is a row
# this isn't the most human-friendly way to look at data, but it is really helpful if we want to use dplyr group_by() functions like mutate() and summarize()
medn_visits
## # A tibble: 18 x 3
## unit_code year visitors
## <chr> <dbl> <dbl>
## 1 SAMO 2010 568371
## 2 CABR 2010 763140
## 3 CHIS 2010 277515
## 4 SAMO 2011 609636
## 5 CABR 2011 813351
## 6 CHIS 2011 242756
## 7 SAMO 2012 649471
## 8 CABR 2012 877951
## 9 CHIS 2012 249594
## 10 SAMO 2013 633054
## 11 CABR 2013 856128
## 12 CHIS 2013 212029
## 13 SAMO 2014 694714
## 14 CABR 2014 893434
## 15 CHIS 2014 342161
## 16 SAMO 2015 797126
## 17 CABR 2015 981825
## 18 CHIS 2015 324816
# conversely, we can pivot longer to make the data wide format
# it is more human-friendly but you can't use dplyr functions on it easily
medn_visits_wide <- pivot_wider(medn_visits, names_from = year, values_from = visitors)
medn_visits_wide
## # A tibble: 3 x 7
## unit_code `2010` `2011` `2012` `2013` `2014` `2015`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SAMO 568371 609636 649471 633054 694714 797126
## 2 CABR 763140 813351 877951 856128 893434 981825
## 3 CHIS 277515 242756 249594 212029 342161 324816
There are two functions we’ll discuss today:
pivot_wider() : converts from long –> widepivot_longer() : converts from wide –> longFull disclosure, I almost always have to google the arguments that go into the pivot_ functions. Here’s the website for pivot_wider() https://tidyr.tidyverse.org/reference/pivot_wider.html
Here’s a short example - pivoting the MEDN visits from long to wide, with more detail than the prior code chunk.
# check out the original data (medn_visits) in long format
medn_visits
## # A tibble: 18 x 3
## unit_code year visitors
## <chr> <dbl> <dbl>
## 1 SAMO 2010 568371
## 2 CABR 2010 763140
## 3 CHIS 2010 277515
## 4 SAMO 2011 609636
## 5 CABR 2011 813351
## 6 CHIS 2011 242756
## 7 SAMO 2012 649471
## 8 CABR 2012 877951
## 9 CHIS 2012 249594
## 10 SAMO 2013 633054
## 11 CABR 2013 856128
## 12 CHIS 2013 212029
## 13 SAMO 2014 694714
## 14 CABR 2014 893434
## 15 CHIS 2014 342161
## 16 SAMO 2015 797126
## 17 CABR 2015 981825
## 18 CHIS 2015 324816
# if we want to make each row a park and each column a year, we can pivot wider
medn_visits_wide <- medn_visits %>%
pivot_wider(
# names from is the thing you want to be the columns
names_from = year,
# values from is what you want to fill the columns with
values_from = visitors
)
# check out the result
medn_visits_wide
## # A tibble: 3 x 7
## unit_code `2010` `2011` `2012` `2013` `2014` `2015`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SAMO 568371 609636 649471 633054 694714 797126
## 2 CABR 763140 813351 877951 856128 893434 981825
## 3 CHIS 277515 242756 249594 212029 342161 324816
# we can also make park units the columns and fill it with visitor data for each year
medn_visits_wide <- medn_visits %>%
pivot_wider(names_from = unit_code, values_from = visitors)
medn_visits_wide
## # A tibble: 6 x 4
## year SAMO CABR CHIS
## <dbl> <dbl> <dbl> <dbl>
## 1 2010 568371 763140 277515
## 2 2011 609636 813351 242756
## 3 2012 649471 877951 249594
## 4 2013 633054 856128 212029
## 5 2014 694714 893434 342161
## 6 2015 797126 981825 324816
Now that we have our data in wide format, it is time to revert back to pivoting longer. Here’s the reference material for pivot_longer() https://tidyr.tidyverse.org/reference/pivot_longer.html.
# check out the original data (medn_visits_wide) in wide format
medn_visits_wide
## # A tibble: 6 x 4
## year SAMO CABR CHIS
## <dbl> <dbl> <dbl> <dbl>
## 1 2010 568371 763140 277515
## 2 2011 609636 813351 242756
## 3 2012 649471 877951 249594
## 4 2013 633054 856128 212029
## 5 2014 694714 893434 342161
## 6 2015 797126 981825 324816
# if we want to make each row an observation, with an associated park and year, we'll pivot_longer
medn_visits_long <- medn_visits_wide %>%
pivot_longer(
# first argument is the columns you'd like to transform
SAMO:CHIS,
# next is what you'd like to name the former name cols
names_to = "unit_code",
# last is what you'd like to name the former values cols
values_to = "visitors"
)
# check out the result
medn_visits_long
## # A tibble: 18 x 3
## year unit_code visitors
## <dbl> <chr> <dbl>
## 1 2010 SAMO 568371
## 2 2010 CABR 763140
## 3 2010 CHIS 277515
## 4 2011 SAMO 609636
## 5 2011 CABR 813351
## 6 2011 CHIS 242756
## 7 2012 SAMO 649471
## 8 2012 CABR 877951
## 9 2012 CHIS 249594
## 10 2013 SAMO 633054
## 11 2013 CABR 856128
## 12 2013 CHIS 212029
## 13 2014 SAMO 694714
## 14 2014 CABR 893434
## 15 2014 CHIS 342161
## 16 2015 SAMO 797126
## 17 2015 CABR 981825
## 18 2015 CHIS 324816
# this should look familiar - like the original medn_visits data
This is a more involved challenge that will draw on your knowledge of both dplyr (from Module 3) and pivoting.
Using the state_pop data, accomplish the following tasks:
data.frame in long formatHint: if your column names are numbers, like years, you must refer to them using back-ticks. A column called 2020 would be `2020`
# get 3 states of choice (FL, CA, AK) for years 2010-2015
solution_original <- state_pop %>%
filter(state %in% c("FL", "AK", "CA") &
year >= 2010 & year <= 2015)
# pivot wider
solution_wide <- solution_original %>%
pivot_wider(
names_from = "year",
values_from = "pop"
)
# pivot longer to return to format of solution_original
solution_long <- solution_wide %>%
pivot_longer(`2010`:`2015`, names_to = "year", values_to = "pop")
Oftentimes we want to use data that is split into two or more data.frames. This section will go over the use of functions to “join” or combine data from two tables into one.
We will:
Use bind_rows() and bind_columns() for simple cases
Used dplyr commands for more complicated joins
left_join() and right join())full join()inner_join()semi_join()anti_join()If you’re starting with this section, run the following code chunk to get caught up.
# call packages
library("readr")
library("janitor")
library("tidyverse")
# get data
park_visits <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv")
state_pop <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/state_pop.csv")
gas_price <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv")
# cut some columns from park_visits data and make sure year is nuemric data only
park_visits <- park_visits %>%
select(year, parkname, unit_code, state, visitors) %>%
filter(year != "Total") %>%
mutate(year = as.integer(year))
bind_rows() and bind_columns() simply merge two data.frames directly.
bind_rows()bind_rows() adds the rows in the second data.frame to the bottom of the first data.frame. This generally only makes sense if they have many identical columns. Make sure that columns with the same data have the same name!
# we'll first make two subsets of the park_visits data
SHEN_visits <- park_visits %>%
# get data from Shenandoah.
filter(unit_code == "SHEN")
SHEN_visits
## # A tibble: 81 x 5
## year parkname unit_code state visitors
## <int> <chr> <chr> <chr> <dbl>
## 1 1936 <NA> SHEN VA 694098
## 2 2016 <NA> SHEN VA 1437341
## 3 2015 <NA> SHEN VA 1321873
## 4 2014 <NA> SHEN VA 1255321
## 5 2013 <NA> SHEN VA 1136505
## 6 2012 <NA> SHEN VA 1210200
## 7 2011 <NA> SHEN VA 1209883
## 8 2010 <NA> SHEN VA 1253386
## 9 2009 <NA> SHEN VA 1120981
## 10 2008 <NA> SHEN VA 1075878
## # ... with 71 more rows
ACAD_visits <- park_visits %>%
# get data from Acadia.
filter(unit_code == "ACAD")
ACAD_visits
## # A tibble: 98 x 5
## year parkname unit_code state visitors
## <int> <chr> <chr> <chr> <dbl>
## 1 1919 <NA> ACAD ME 64000
## 2 2016 <NA> ACAD ME 3303393
## 3 2015 <NA> ACAD ME 2811184
## 4 2014 <NA> ACAD ME 2563129
## 5 2013 <NA> ACAD ME 2254922
## 6 2012 <NA> ACAD ME 2431052
## 7 2011 <NA> ACAD ME 2374645
## 8 2010 <NA> ACAD ME 2504208
## 9 2009 <NA> ACAD ME 2227698
## 10 2008 <NA> ACAD ME 2075857
## # ... with 88 more rows
# now we will combine the 2 data.frames into one.
SHEN_ACAD_visits <- bind_rows(SHEN_visits, ACAD_visits)
# check that data from both parks is there
SHEN_ACAD_visits %>%
pull(unit_code) %>%
table()
## .
## ACAD SHEN
## 98 81
rm(SHEN_ACAD_visits, SHEN_visits, ACAD_visits)
You generally would not split up a data.frame and then put it together again like this, but it can be useful if you have data that has the same data structure but starts out broken up by site, date, park etc. and you wish to combine it into one data.frame.
bind_columns()bind_columns() adds the columns from the second data.frame to the right of the first data.frame.
# we'll first make two subsets of the park_visits data
visits_left <- park_visits %>%
# get first 2 columns.
select(year, parkname)
visits_left
## # A tibble: 21,174 x 2
## year parkname
## <int> <chr>
## 1 1904 Crater Lake
## 2 1941 Lake Roosevelt
## 3 1961 Lewis and Clark
## 4 1935 Olympic
## 5 1982 Santa Monica Mountains
## 6 1919 <NA>
## 7 1969 <NA>
## 8 1967 <NA>
## 9 1944 <NA>
## 10 1989 <NA>
## # ... with 21,164 more rows
visits_right <- park_visits %>%
# get last 3 columns.
select(unit_code, state, visitors)
visits_right
## # A tibble: 21,174 x 3
## unit_code state visitors
## <chr> <chr> <dbl>
## 1 CRLA OR 1500
## 2 LARO WA 0
## 3 LEWI WA 69000
## 4 OLYM WA 2200
## 5 SAMO CA 468144
## 6 ACAD ME 64000
## 7 AMIS TX 448000
## 8 ASIS MD 738700
## 9 BIBE TX 1409
## 10 BICY FL 81157
## # ... with 21,164 more rows
visits_bound <- bind_cols(visits_left, visits_right)
visits_bound
## # A tibble: 21,174 x 5
## year parkname unit_code state visitors
## <int> <chr> <chr> <chr> <dbl>
## 1 1904 Crater Lake CRLA OR 1500
## 2 1941 Lake Roosevelt LARO WA 0
## 3 1961 Lewis and Clark LEWI WA 69000
## 4 1935 Olympic OLYM WA 2200
## 5 1982 Santa Monica Mountains SAMO CA 468144
## 6 1919 <NA> ACAD ME 64000
## 7 1969 <NA> AMIS TX 448000
## 8 1967 <NA> ASIS MD 738700
## 9 1944 <NA> BIBE TX 1409
## 10 1989 <NA> BICY FL 81157
## # ... with 21,164 more rows
rm(visits_left, visits_right, visits_bound)
bind_columns() assumes that the rows are in the same order in each data.frame, and that there are no rows in one data.frame that are missing in the other. Unless you are sure that these are true, you should avoid using this function.
bind_rows() challenge
Using the park_visits data, make 2 smaller data.frames:
Then, using bind_rows() combine the 2 data.frames.
# Atlantic islands
Atlantic <- park_visits %>%
filter(state %in% c("PR", "VI"))
# Pacific Islands
Pacific <- park_visits %>%
filter(state %in% c("AS", "GU", "HI"))
# All islands
Islands <- bind_rows(Atlantic, Pacific)
rm(Atlantic, Pacific, Islands)
left_join() and right_join()
left_join() and right_join() are a safer way to join columns from one data.frame to the columns in another.
For either of these functions, you need to specify one or more key columns using the by argument. Key columns are the data that is used to determine which rows in once data.frame match the rows in the other. The order of the rows in the data.frames does not matter.
The syntax is left_join(x,y, by="name") where x and y are data.frames.
The only difference between left_join() and right_join() is what happens when the rows in the data.frames don’t match up one to one. left_join() keeps all the rows in x and drops any rows in y that don’t match, while right_join() keeps all the rows in y and drop any rows in x that don’t match.
Let’s combine the gas_price data with the state_pop data. The gas price data does not vary by state, so we just have to match the data by year. In order to make the way these functions work a little clearer, we will first filter the state_pop data to only include data before 2000, so that there is less of an overlap between the 2 data sources.
# filter the state data
state_pop2 <- state_pop %>% filter(year < 2000)
# left_join example
state_pop_gas1 <- left_join(
x = state_pop2, # first data.frame
y = gas_price, # second data.frame
by = "year" # key column
)
state_pop_gas1
## # A tibble: 5,100 x 5
## year state pop gas_current gas_constant
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1900 AL 1830000 NA NA
## 2 1901 AL 1907000 NA NA
## 3 1902 AL 1935000 NA NA
## 4 1903 AL 1957000 NA NA
## 5 1904 AL 1978000 NA NA
## 6 1905 AL 2012000 NA NA
## 7 1906 AL 2045000 NA NA
## 8 1907 AL 2058000 NA NA
## 9 1908 AL 2070000 NA NA
## 10 1909 AL 2108000 NA NA
## # ... with 5,090 more rows
A couple of thing to notice. The state_pop_gas1 data.frame has 5100 rows because that is what is in the state_pop2 data.frame. Every year is repeated 51 times (once for each state + DC) so the gas price data for a given year gets repeated 51 times as well. The gas_price data only goes back to 1929, so for years before that, the data just has NA for gas prices.
Here is the exact same thing, but this time using right_join()
# left_join example
state_pop_gas2 <- right_join(
x = state_pop2, # first data.frame
y = gas_price, # second data.frame
by = "year" # key column
)
state_pop_gas2
## # A tibble: 3,637 x 5
## year state pop gas_current gas_constant
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1929 AL 2644000 0.21 2.38
## 2 1930 AL 2647000 0.2 2.3
## 3 1931 AL 2649000 0.17 2.18
## 4 1932 AL 2653000 0.18 2.61
## 5 1933 AL 2661000 0.18 2.66
## 6 1934 AL 2685000 0.19 2.67
## 7 1935 AL 2719000 0.19 2.62
## 8 1936 AL 2743000 0.19 2.67
## 9 1937 AL 2762000 0.2 2.63
## 10 1938 AL 2787000 0.2 2.64
## # ... with 3,627 more rows
Now the data is very different. It has 3637 rows, which seems pretty random. What has happened is:
state_pop2 data.frame.So, all the data from gas_price is present, if necessary it was repeated to match up with the state_pop2 data.
inner_join() and full_join()
In the previous section we used left and right joins to get all of the data from one data.frame with matching data from a second data.frame. Now we will look at 2 other joins:
inner_join() - only rows that have matches in both data.frames are keptfull_join() - you get all the rows from both datasets even if some don’t match.inner join:
# inner_join example
state_pop_gas3 <- inner_join(
x = state_pop2, # first data.frame
y = gas_price, # second data.frame
by = "year" # key column
)
state_pop_gas3
## # A tibble: 3,621 x 5
## year state pop gas_current gas_constant
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1929 AL 2644000 0.21 2.38
## 2 1930 AL 2647000 0.2 2.3
## 3 1931 AL 2649000 0.17 2.18
## 4 1932 AL 2653000 0.18 2.61
## 5 1933 AL 2661000 0.18 2.66
## 6 1934 AL 2685000 0.19 2.67
## 7 1935 AL 2719000 0.19 2.62
## 8 1936 AL 2743000 0.19 2.67
## 9 1937 AL 2762000 0.2 2.63
## 10 1938 AL 2787000 0.2 2.64
## # ... with 3,611 more rows
Now the data.frame has 3621 rows = there are 71 years of data from 51 states. This is the smallest data.frame from these joins. The years in state_pop_gas3 are only those that are found in both data.frames.
full join:
# full_join example
state_pop_gas4 <- full_join(
x = state_pop2, # first data.frame
y = gas_price, # second data.frame
by = "year" # key column
)
state_pop_gas4
## # A tibble: 5,116 x 5
## year state pop gas_current gas_constant
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1900 AL 1830000 NA NA
## 2 1901 AL 1907000 NA NA
## 3 1902 AL 1935000 NA NA
## 4 1903 AL 1957000 NA NA
## 5 1904 AL 1978000 NA NA
## 6 1905 AL 2012000 NA NA
## 7 1906 AL 2045000 NA NA
## 8 1907 AL 2058000 NA NA
## 9 1908 AL 2070000 NA NA
## 10 1909 AL 2108000 NA NA
## # ... with 5,106 more rows
rm(state_pop_gas1, state_pop_gas2, state_pop_gas3, state_pop_gas4)
This data.frame has 5116 rows, there most of all the data.frames. This includes all data from either data.frame regardless of if there is a match.
Using the park_visits and the state_pop data create a new data.frame with all the data from park_visits and any matching data from state_pop.
Hint if you need to use 2 columns as keys the you write the by= argument like so: by=c("Column1","Column2").
park_visits_pop <- left_join(park_visits, state_pop, by = c("state", "year"))
rm(park_visits_pop)
Sometimes you don’t want to combine two tables, but you do want to filter one table so that it only has data that can match data in another table. semi_join() is used for this
semi join:
# semi_join example
state_pop_gas5 <- semi_join(
x = state_pop2, # first data.frame
y = gas_price, # second data.frame
by = "year" # key column
)
state_pop_gas5
## # A tibble: 3,621 x 3
## year state pop
## <dbl> <chr> <dbl>
## 1 1929 AL 2644000
## 2 1930 AL 2647000
## 3 1931 AL 2649000
## 4 1932 AL 2653000
## 5 1933 AL 2661000
## 6 1934 AL 2685000
## 7 1935 AL 2719000
## 8 1936 AL 2743000
## 9 1937 AL 2762000
## 10 1938 AL 2787000
## # ... with 3,611 more rows
The columns in state_pop_gas4 are the same that are in state_pop2. However,all the data from year prior to 1929 has now been removed, because pre-1929 data is not in the gas_price data.
anti_join() is the opposite of semi_join(). It will keep the data one data.frame that does not have a match in the second data.frame.
anti_join:
# anti_join example
state_pop_gas6 <- anti_join(
x = state_pop2, # first data.frame
y = gas_price, # second data.frame
by = "year" # key column
)
state_pop_gas6
## # A tibble: 1,479 x 3
## year state pop
## <dbl> <chr> <dbl>
## 1 1900 AL 1830000
## 2 1901 AL 1907000
## 3 1902 AL 1935000
## 4 1903 AL 1957000
## 5 1904 AL 1978000
## 6 1905 AL 2012000
## 7 1906 AL 2045000
## 8 1907 AL 2058000
## 9 1908 AL 2070000
## 10 1909 AL 2108000
## # ... with 1,469 more rows
# clean up
rm(state_pop2, state_pop_gas1, state_pop_gas2, state_pop_gas3, state_pop_gas4, state_pop_gas5, state_pop_gas6)
state_pop_gas_6 only has 1479 rows, because it only contains data from before 1929 when the gas_price data starts.
Filter the park_visits data so that it only contains data that can be joined with the data in and the state_pop data create a new data.frame with all the data from park_visit and any matching data from state_pop and gas_price.
Note that the state_pop data has some NAs for population for Alaska and Hawaii prior to 1949. For a bigger challenge, don’t include data those state / year combinations in your final data set.
Hint: the function is.na(data) will be TRUE when a value is NA. If you put an exclamation mark in front !is.na(data) it will tell you when the values are not NA.
## In this case we filter out the NA values in the semi_join function. You could also filter them our first and save that to a new data.frame and then do the join.
park_visits_filtered<-semi_join(park_visits, state_pop %>% filter(!is.na(pop)), by=c("state", "year")) %>%
semi_join(gas_price, by="year")
park_visits_filtered
## # A tibble: 19,964 x 5
## year parkname unit_code state visitors
## <int> <chr> <chr> <chr> <dbl>
## 1 1941 Lake Roosevelt LARO WA 0
## 2 1961 Lewis and Clark LEWI WA 69000
## 3 1935 Olympic OLYM WA 2200
## 4 1982 Santa Monica Mountains SAMO CA 468144
## 5 1969 <NA> AMIS TX 448000
## 6 1967 <NA> ASIS MD 738700
## 7 1944 <NA> BIBE TX 1409
## 8 1989 <NA> BICY FL 81157
## 9 1988 <NA> BISO TN 613011
## 10 1941 <NA> BLRI NC 895874
## # ... with 19,954 more rows
rm(park_visits_filtered)
Topics to cover
If you’ve tried to do GIS and make maps in R even a few years ago, you probably encountered the same frustrations I did. There were a ton of packages out there, each with its own file format and coding convention, and each package rarely had all the features I needed. It was not easy to navigate, and I often found myself just exporting my work out of R and doing the rest in ArcGIS… Enter the sf and tmap packages, which are the latest and greatest R packages devoted to GIS and map making! Most of the frustrations I had with earlier packages have been resolved with these two packages.
The sf package is the workhorse for anything you need to do with spatial vector data. File types with sf are called simple features, which follow a set of GIS standards that are becoming the universal data model for vector-based spatial data in R and that most GIS-based packages in R now employ. Simple features are also now the standard for open source GIS in general. That means if you’re trying to troubleshoot something with simple features, you’ll need to specify that it’s for R, rather than PostGIS or some other implementation. The sf package is also superseding the rgdal package, which used to be the more common data model in R and open source GIS. The more I use this package, the more impressed I am with how intuitive it is to use, and how well documented the functions are. For vector data, I have yet to need to perform a task that sf couldn’t do.
The main application of tmap package is making maps, and it was designed using a grammar of graphics philosophy similar to ggplot2. In practice for tmap, it means that maps are built as a collection of many small functions/layers that get added together, and order matters. There are also tmap-enabled functions that you can use in ggplot2 plots too, but you can do a lot more in tmap. I also prefer the look of tmap’s built-in compass, legends, etc. over the ones available in ggspatial, which is an add-on package for plotting spatial data in ggplot2.
The raster package is also an excellent package for analyzing/processing raster data. For large jobs, I find the raster package easier to work with than ESRI tools, and it tends to run a lot faster than ESRI built-in tools (I haven’t compared with python).
Finally, the leaflet package in R allows you to create interactive maps as html widgets. These are often included in R shiny apps, but they can also be used in R Markdown with HTML output (for more on that, attend next Wednesday’s session on R Markdown). An internet connection is required for the maps to be interactive. Without an internet connection the map will show the default extent defined by the code.
leaflet package is relatively easy to use for basic mapping. For more advanced features or to customize it further, you often end up having to code in JavaScript, which is the language leaflet was originally programmed in. There’s also a lot more online help available for the JavaScript version of the leaflet library than the R version. If you’re really stumped about something, you may find your answer with the JavaScript help.
My two favorite features of sf are 1) the attribute table of a simple feature (sf’s equivalent of a shapefile) is a dataframe in R, and 2) package functions are pipeable with tidyverse functions. That means, if you want to delete points in your layer, you can use dplyr’s filter() function to filter out those points. The sf package will update the geometry of the layer to remove the points that were filtered out.
To demonstrate the use of sf and tmap, I’m going to generate a simple GRTS sample using spsurvey, which now connects to sf instead of sp and rgdal. Then we’ll filter out points that don’t fall in a forest habitat to demonstrate how to work with sf data in R. Finally we’ll plot the results using tmap.
I wasn’t able to figure out how to post a shapefile that you could easily download in R with a url. I emailed them to you as data.zip the previous week. I also posted all of the files to our Scientists Team, which can be downloaded using the following link: https://doimspp.sharepoint.com/:f:/s/ScientistsTraining2022/EiXOTYV1l4RCk1sMr5yXhZUB1ZFEaAlAN4elvsYbBfYHRg?e=ktVy5n. To follow along, you’ll need to download (and unzip if you’re using the email attachment) the shapefiles and save them to your working directory.
Once those steps are completed, we’re ready to generate a GRTS sample and start making a map. Note that I’m using 6-digit hex colors (i.e., “#ffffff” is white) to define the fill color for each habitat type. To find your own or see what color these look like, check out htmlcolorcodes.com
#install.packages(c("sf", "spsurvey"))
library(dplyr) # for filter, select, mutate and %>%
library(sf)
library(tmap)
library(spsurvey)
# Generate a random number for the seed
sample(1:100000, 1) #62051
set.seed(62051)
# Read in shapefiles from teams data folder
sara_bound1 <- st_read("./shapefiles/SARA_boundary_4269.shp")
sara_veg1 <- st_read("./shapefiles/SARA_Veg.shp")
# Check that the projections match; fix the one that doesn't match
st_crs(sara_bound1) == st_crs(sara_veg1) # FALSE- projections don't match.
# sara_bound1 needs to be reprojected to UTM Zone 18N NAD83.
sara_bound <- st_transform(sara_bound1, crs = 26918)
st_crs(sara_bound) == st_crs(sara_veg1) # TRUE
# Quick plot
plot(sara_bound[1])
plot(sara_veg1[1]) # bigger extent than boundary
# Intersect boundary and veg to be same extend
sara_veg <- st_intersection(sara_veg1, sara_bound)
plot(sara_veg[1])
# View attribute table of layers
head(sara_bound) # 1 feature with 95 fields
str(sara_veg)
head(sara_veg)
names(sara_veg)
table(sara_veg$ANDERSONII)
# Simplify vegetation types for easier plotting
dev <- c("1. Urban or Built-up Land", "11. Residential",
"12. Commercial and Services", "13. Industrial",
"14. Transportation, Communications, and Utilities",
"17. Other Urban or Built-up Land")
crop <- c("21. Cropland and Pasture",
"22. Orchards, Groves, Vineyards, and Nurseries",
"31. Herbaceous Rangeland")
shrubland <- c("32. Shrub and Brush Rangeland")
forest_up <- c("41. Deciduous Forest Land", "42. Evergreen Forest Land",
"43. Mixed Forest Land")
forest_wet <- c("61. Forested Wetland")
open_wet <- c("62. Nonforested wetland", "62. Nonforested Wetland")
water <- c("5. Water", "51. Streams and Canals", "53. Reservoirs")
unk <- "Unclassified"
# Create 2 fields in the veg attribute table: simp_veg, and fills
sara_veg <- sara_veg %>%
mutate(simp_veg = case_when(ANDERSONII %in% dev ~ "Developed",
ANDERSONII %in% crop ~ "Open field",
ANDERSONII %in% shrubland ~ "Shrublands",
ANDERSONII %in% forest_up ~ "Forest",
ANDERSONII %in% forest_wet ~ "Forested wetland",
ANDERSONII %in% open_wet ~ "Open wetland",
ANDERSONII %in% water ~ "Open water",
ANDERSONII %in% unk ~ "Unclassified",
TRUE ~ "Unknown"),
fill_col = case_when(simp_veg == "Developed" ~ "#D8D8D8",
simp_veg == "Open field" ~ "#f5f0b0",
simp_veg == "Shrublands" ~ "#F29839",
simp_veg == "Powerline ROW" ~ "#F9421D",
simp_veg == "Forest" ~ "#55785e",
simp_veg == "Forested wetland" ~ "#9577a6",
simp_veg == "Open wetland" ~ "#c497d4",
simp_veg == "Open water" ~ "#AFD0F2",
simp_veg == "Unclassified" ~ "#ffffff"))
Before moving to the next step, I wanted to show how easy it is to create simple features from dataframes that contain X/Y coordinates. We’ll read in a fake dataset in the GitHub repo for this training, and call it wetdat. The dataset contains fake species composition data for wetland monitoring sites in ACAD and includes X and Y coordinates. We’ll use dplyr functions to calculate the number of invasive and protected species in each site by year combination. Then we’ll make it a simple feature, and save it as a shapefile. Note that there are multiple formats you can save simple features as. I show the shapefile version, in case you find yourself going between R and ArcGIS.
library(dplyr)
# Import data
wetdat <- read.csv(
"https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/ACAD_wetland_data_clean.csv")
# Summarize so that each site x year combination has 1 row
wetsum <- wetdat %>% group_by(Site_Name, Year, X_Coord, Y_Coord) %>%
summarize(num_inv = sum(Invasive), num_prot = sum(Protected),
.groups = 'drop')
# Check first 6 rows of output
head(wetsum)
# Convert dataframe to sf
wetsum_sf <- st_as_sf(wetsum, coords = c("X_Coord", "Y_Coord"), crs = 26919)
# ACAD is UTM Zone 19N NAD83, hence the difference between SARA, which is 18N.
# Write sf to disk as shapefile
st_write(wetsum_sf, "ACAD_wetland_data.shp")
The spsurvey package has been updated to point to sf instead of rgdal. It’s a code-breaking change if you have old R scripts that generated GRTS samples. However, the process is even easier now, as you can see below. Here we’re just going to use the simplest of GRTS designs. The output from grts() has multiple slots. The one we want is sites_base, and you can see how we get that slot in the code below.
Note: While I followed the approach documented in the spsurvey vignette to generate reproducable GRTS samples, it does not appear to be working as I’m testing it. Despite running set.seed(62051) after loading spsurvey and then running the code chunk below, each sample appears to be different.
sara_grts <- grts(sara_bound, n_base = 100) # generate 100 points within SARA boundary
sara_grts$sites_base$priority <- as.numeric(row.names(sara_grts$sites_base)) # add priority number (same as row.name)
First step here is to spatially join the sara_grts layer to the sara_veg layer. Here we only cared about the sara_veg’s simp_veg field, so I used dplyr’s select() function. After joining, you should see a simp_veg field added to the end of the grts layer (it’s actually to the left of geometry, which is the last). In the next step, we then filter the points in the newly created grts_veg layer to only include points that fall in forest habitat.
# Spatial join
grts_veg <- st_join(sara_grts$sites_base, sara_veg %>% select(simp_veg))
names(grts_veg)
# Create list of forest habitats
sort(unique(sara_veg$simp_veg))
forest_veg <- c("Forest", "Forested Wetland")
# Filter out non-forest points
grts_forest <- grts_veg %>% filter(simp_veg %in% forest_veg)
nrow(grts_veg) # 100 points
nrow(grts_forest) # fewer points
Now it’s time to plot. The code below may seem a bit much, but we’ll break it down piece by piece. The great thing about building maps this way is that you’re essentially building a template in code that you can steal from and adapt for future maps. You can even turn these into a function that’s even easier to use in future maps. That’s beyond what we can cover here, but it’s another great benefit of making maps in R instead of ArcGIS.
The code below is broken into the various pieces that make up the map. The way tmap works is that first, you have to add the layer via tm_shape(), and then you specify how you want that layer to look, like tm_fill(), or tm_border(). Each piece has its own legend as well. This is similar to how you start each ggplot2 graph defining the data with ggplot(data, aes(x = xvar, y = yvar)), and then start adding geom_point(), or whatever else to define how the data are plotted. The difference with tmap is that every layer you want in the map has to be coded this way. Finally tm_layout() is similar to theme() in ggplot2, and is where you can customize the map layout.
for_legend makes a list of the habitat types in simp_veg and their associated colors. That saves time having to type them all over again, and potentially spelling them wrong.
tm_shape() in the map sets the projection and the bounding box. If you don’t set the bounding box, the map will show the largest extent of your layers. So if you have a roads layer at the state-level, your map will be zoomed at that extent, instead of the park boundary.
tm_fill(), which fills in colors, and tm_borders(), which only plots outlines. If you want both borders and fills, use tm_polygons() instead.
tm_add_legend() allows you to set the order that each legend appears in the map. Otherwise, they’ll appear in the order you specify the layers.
tm_symbols() allows you to change the symbol format in a similar way to ggplot2. We then added tm_text() to label each point using the numbers in the priority column of the grts_forest attribute table. The xmod and ymod allow you to offset labels from the points either horizontally and vertically. In this case, negative xmod moves the label to the left, and a negative ymod moves the label below the point. The default location for labels is directly on top of the point.
tm_layout() code is where you can change the default settings of the figure, including font size, placement of the legend, and margins. The title name and position are also set in the tm_layout().
tmap_save() allows you to write the map to disk and to specify the height and width of the map. I prefer to write maps to disk to see what the size looks like before changing any of the layout and font size settings, because the figure on your disk will look different (and often better) than the plot view in your R Studio session.
# Creating list of simp_veg types and their fill colors for easier legend
for_legend <- unique(data.frame(simp_veg = sara_veg$simp_veg, fill_col = sara_veg$fill_col))
sara_map <-
# Vegetation map
tm_shape(sara_veg, projection = 26918, bbox = sara_bound) +
tm_fill("fill_col") +
tm_add_legend(type = 'fill', labels = for_legend$simp_veg,
col = for_legend$fill_col, z = 3) +
# Park boundary
tm_shape(sara_bound) +
tm_borders('black', lwd = 2) +
tm_add_legend(type = 'line', labels = "Park Boundary", col = "black",
z = 2)+
# GRTS points
tm_shape(grts_forest) +
tm_symbols(shapes = 21, col = "#EAFF16", border.lwd = 0.5, size = 0.3) +
tm_text("priority", size = 0.9, xmod = -0.4, ymod = -0.4) +
tm_add_legend(type = 'symbol', labels = "GRTS points", shape = 21,
col = "#EAFF16", border.lwd = 1, size = 0.5,
z = 1) +
# Other map features
tm_compass(size = 2, type = 'arrow',
text.size = 1, position = c('left', 'bottom')) +
tm_scale_bar(text.size = 1.25, position = c("center", "bottom")) +
tm_layout(inner.margins = c(0.2, 0.02, 0.02, 0.02), # make room for legend
outer.margins = 0,
legend.text.size = 1.25,
legend.just = 'right',
legend.position = c("right", "bottom"),
title = "Saratoga NHP GRTS points",
title.position = c("center", "top"))
tmap_save(sara_map, "SARA_GRTS.png", height = 10.5, width = 8,
units = 'in', dpi = 600, outer.margins = 0)
Here’s what sara_map looks like in your plot viewer:
sara_map
Here’s what the map looks like after it’s written to disk and the dimensions are set using tmap_save():
The last cool thing to show with tmap, is that you can include interactive HTML widgets similar to what leaflet can do (coming next). With the interactive mode, you can change baselayers, turn your layers off/on, and zoom to different extent. The legend in the interactive mode is a bit limited to only show fills, but it’s still pretty cool.
tmap_mode("view") # turn interactive mode on
sara_map